Figure 1: Illumina Sequencing Workflow
Genome sequcencing data is often stored in one of two formats, FASTA and FASTQ text files. For example a FASTA files looks like the following:
Figure 2: FASTA file format
We can also store confidence or quality scores using a FASTQ format:
In order to translate FASTQ quality scores:
Figure 4: Fastq Encoding
And now converting to confidence probabilities:
Figure 5: Fastq encoding quality scores
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides a simple way to do QC checks on raw sequence data:
Figure 5: Fastqc scores
Figure 6: Fastqc score distribution
Figure 7: Fastqc base distribution
Figure 8: Fastqc N distribution
Find the genomic Location of origin for the sequencing read. Software: Bowtie2, TopHat, STAR, Subread/Rsubread, many others!
Figure 9: Sequence read alignment
Here is quick tutorial on sequnce aligment.
The following userguide will be helpful for you:
http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf
Abraham Lincoln: “Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” (4 minutes indexing the genome, 2 minutes aligning the reads)
Note that you will rarely do this for human alignment. You will usually download an existing index given to you by others who have already done this work. You will do this often if you are aligning microbial reads, e.g. MTB or some other organism for which others have not already made your index for you.
buildindex(basename="genome/ucsc.hg19.chr1_120-150M",reference="genome/ucsc.hg19.chr1_120-150M.fasta.gz")
## WARNING: your system seems to be 32-bit. Rsubread supports 32-bit systems to a very limited level.
## It is highly recommended to run Rsubread on a 64-bit system to avoid errors.
##
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.12.3
##
## //================================= setting ==================================\\
## || ||
## || Index name : ucsc.hg19.chr1_120-150M ||
## || Index space : base space ||
## || Index split : no-split ||
## || Repeat threshold : 100 repeats ||
## || Gapped index : no ||
## || ||
## || Free / total memory : 11.1GB / 24.0GB ||
## || ||
## || Input files : 1 file in total ||
## || o ucsc.hg19.chr1_120-150M.fasta.gz ||
## || ||
## \\============================================================================//
##
## //================================= Running ==================================\\
## || ||
## || Check the integrity of provided reference sequences ... ||
## || There were 4800000 notes for reference sequences. ||
## || The notes can be found in the log file, 'genome/ucsc.hg19.chr1_120-150 ... ||
## || Scan uninformative subreads in reference sequences ... ||
## || 516 uninformative subreads were found. ||
## || These subreads were excluded from index building. ||
## || Estimate the index size... ||
## || 8%, 0 mins elapsed, rate=18922.0k bps/s ||
## || 16%, 0 mins elapsed, rate=26103.2k bps/s ||
## || 24%, 0 mins elapsed, rate=29874.7k bps/s ||
## || 33%, 0 mins elapsed, rate=32146.7k bps/s ||
## || 41%, 0 mins elapsed, rate=33710.9k bps/s ||
## || 49%, 0 mins elapsed, rate=34838.4k bps/s ||
## || 58%, 0 mins elapsed, rate=35686.4k bps/s ||
## || 66%, 0 mins elapsed, rate=36346.2k bps/s ||
## || 74%, 0 mins elapsed, rate=36882.6k bps/s ||
## || 83%, 0 mins elapsed, rate=33709.4k bps/s ||
## || 91%, 0 mins elapsed, rate=30610.5k bps/s ||
## || 3.0 GB of memory is needed for index building. ||
## || Build the index... ||
## || 8%, 0 mins elapsed, rate=3896.2k bps/s ||
## || 16%, 0 mins elapsed, rate=6322.5k bps/s ||
## || 24%, 0 mins elapsed, rate=7978.8k bps/s ||
## || 33%, 0 mins elapsed, rate=9178.5k bps/s ||
## || 41%, 0 mins elapsed, rate=10088.6k bps/s ||
## || 49%, 0 mins elapsed, rate=10804.1k bps/s ||
## || 58%, 0 mins elapsed, rate=11376.0k bps/s ||
## || 66%, 0 mins elapsed, rate=11850.6k bps/s ||
## || 74%, 0 mins elapsed, rate=12247.5k bps/s ||
## || 83%, 0 mins elapsed, rate=11620.6k bps/s ||
## || 91%, 0 mins elapsed, rate=10905.1k bps/s ||
## || Save current index block... ||
## || [ 0.0% finished ] ||
## || [ 10.0% finished ] ||
## || [ 20.0% finished ] ||
## || [ 30.0% finished ] ||
## || [ 40.0% finished ] ||
## || [ 50.0% finished ] ||
## || [ 60.0% finished ] ||
## || [ 70.0% finished ] ||
## || [ 80.0% finished ] ||
## || [ 90.0% finished ] ||
## || [ 100.0% finished ] ||
## || ||
## || Total running time: 0.2 minutes. ||
## || Index genome/ucsc.hg19.chr1_120-150M was successfully built. ||
## || ||
## \\============================================================================//
Note that this outputs results in a .bam file and not a .sam file
align(index="genome/ucsc.hg19.chr1_120-150M",readfile1="reads/R01_10_short500K.fq.gz",output_file="alignments/R01_10_short.bam", nthreads=4)
## WARNING: your system seems to be 32-bit. Rsubread supports 32-bit systems to a very limited level.
## It is highly recommended to run Rsubread on a 64-bit system to avoid errors.
##
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.12.3
##
## //================================= setting ==================================\\
## || ||
## || Function : Read alignment (RNA-Seq) ||
## || Input file : R01_10_short500K.fq.gz ||
## || Output file : R01_10_short.bam (BAM) ||
## || Index name : ucsc.hg19.chr1_120-150M ||
## || ||
## || ------------------------------------ ||
## || ||
## || Threads : 4 ||
## || Phred offset : 33 ||
## || Min votes : 3 / 10 ||
## || Max mismatches : 3 ||
## || Max indel length : 5 ||
## || Report multi-mapping reads : yes ||
## || Max alignments per multi-mapping read : 1 ||
## || ||
## \\============================================================================//
##
## //================ Running (20-Jun-2023 02:32:35, pid=26105) =================\\
## || ||
## || Check the input reads. ||
## || The input file contains base space reads. ||
## || Initialise the memory objects. ||
## || Estimate the mean read length. ||
## || The range of Phred scores observed in the data is [2,37] ||
## || Create the output BAM file. ||
## || Check the index. ||
## || Init the voting space. ||
## || Global environment is initialised. ||
## || Load the 1-th index block... ||
## || The index block has been loaded. ||
## || Start read mapping in chunk. ||
## || 5% completed, 1.8 mins elapsed, rate=229.7k reads per second ||
## || 11% completed, 1.8 mins elapsed, rate=250.9k reads per second ||
## || 18% completed, 1.8 mins elapsed, rate=258.1k reads per second ||
## || 25% completed, 1.8 mins elapsed, rate=262.2k reads per second ||
## || 31% completed, 1.8 mins elapsed, rate=264.6k reads per second ||
## || 38% completed, 1.8 mins elapsed, rate=266.4k reads per second ||
## || 45% completed, 1.8 mins elapsed, rate=267.3k reads per second ||
## || 51% completed, 1.8 mins elapsed, rate=268.3k reads per second ||
## || 58% completed, 1.8 mins elapsed, rate=268.8k reads per second ||
## || 65% completed, 1.8 mins elapsed, rate=269.6k reads per second ||
## || 69% completed, 1.8 mins elapsed, rate=3.2k reads per second ||
## || 73% completed, 1.8 mins elapsed, rate=3.3k reads per second ||
## || 76% completed, 1.8 mins elapsed, rate=3.5k reads per second ||
## || 79% completed, 1.8 mins elapsed, rate=3.7k reads per second ||
## || 83% completed, 1.8 mins elapsed, rate=3.8k reads per second ||
## || 86% completed, 1.8 mins elapsed, rate=4.0k reads per second ||
## || 89% completed, 1.8 mins elapsed, rate=4.1k reads per second ||
## || 93% completed, 1.8 mins elapsed, rate=4.2k reads per second ||
## || 96% completed, 1.8 mins elapsed, rate=4.4k reads per second ||
## || 99% completed, 1.8 mins elapsed, rate=4.5k reads per second ||
## || ||
## || Completed successfully. ||
## || ||
## \\==================================== ====================================//
##
## //================================ Summary =================================\\
## || ||
## || Total reads : 503,568 ||
## || Mapped : 376,964 (74.9%) ||
## || Uniquely mapped : 317,555 ||
## || Multi-mapping : 59,409 ||
## || ||
## || Unmapped : 126,604 ||
## || ||
## || Indels : 2,236 ||
## || ||
## || Running time : 1.8 minutes ||
## || ||
## \\============================================================================//
## R01_10_short.bam
## Total_reads 503568
## Mapped_reads 376964
## Uniquely_mapped_reads 317555
## Multi_mapping_reads 59409
## Unmapped_reads 126604
## Indels 2236
Note that Rsubread outputs a .bam file (bam = binary alignment map) and not a .sam file (sam = sequence alignment map). Here is some information about a .sam file:
https://en.wikipedia.org/wiki/SAM_(file_format)
Figure 10: SAM and BAM file format
https://samtools.github.io/hts-specs/SAMv1.pdf
To convert .sam to .bam or vice versa, a package called Rsamtools. Using Rsamtools, you can convert bam to sam as follows:
asSam("alignments/R01_10_short.bam", overwrite=T)
## [1] "alignments/R01_10_short.sam"
# To convert to bam:
#asBam("alignments/R01_10_short.bam")
Makes a system call to the Mac terminal to generate a .sam file
Now we can count reads hitting genes. Approaches/software:
Figure 11: Feature Counts
fCountsList = featureCounts("alignments/R01_10_short.bam", annot.ext="genome/genes.chr1_120-150M.gtf", isGTFAnnotationFile=TRUE)
## Warning: your system seems to be 32-bit. Rsubread supports 32-bit systems to a limited level only.
## We recommend that Rsubread be run on 64-bit systems to avoid any possible problems.
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.12.3
##
## //========================== featureCounts setting ===========================\\
## || ||
## || Input files : 1 BAM file ||
## || ||
## || R01_10_short.bam ||
## || ||
## || Paired-end : no ||
## || Count read pairs : no ||
## || Annotation : genes.chr1_120-150M.gtf (GTF) ||
## || Dir for temp files : . ||
## || Threads : 1 ||
## || Level : meta-feature level ||
## || Multimapping reads : counted ||
## || Multi-overlapping reads : not counted ||
## || Min overlapping bases : 1 ||
## || ||
## \\============================================================================//
##
## //================================= Running ==================================\\
## || ||
## || Load annotation file genes.chr1_120-150M.gtf ... ||
## || Features : 1918 ||
## || Meta-features : 104 ||
## || Chromosomes/contigs : 1 ||
## || ||
## || Process BAM file R01_10_short.bam... ||
## || Single-end reads are included. ||
## || Total alignments : 503568 ||
## || Successfully assigned alignments : 4561 (0.9%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Write the final count table. ||
## || Write the read assignment summary. ||
## || ||
## \\============================================================================//
featureCounts = cbind(fCountsList$annotation[,1], fCountsList$counts)
write.table(featureCounts, "alignments/R01_10_short.features.txt", sep="\t", col.names=FALSE, row.names=FALSE, quote=FALSE)
#install.packages("devtools")
#devtools::install_github("wevanjohnson/singleCellTK")
library(singleCellTK)
singleCellTK()
### open features_combined.txt
### and meta_data.txt